Your goal is to forecast revenue, and you’d like to understand if their are relationships between Revenue trends / spikes and webpage traffic (do spikes in revenue relate to traffic being driven to specific pages?).
library(tidyverse)
library(timetk)
library(lubridate)
library(DataExplorer)
Read in the following data sets.
transactions_tbl <- read_rds("../00_data/transactions_weekly.rds")
transactions_tbl
## # A tibble: 91 x 2
## purchased_at revenue
## <date> <dbl>
## 1 2018-06-03 32736.
## 2 2018-06-10 38290.
## 3 2018-06-17 39974.
## 4 2018-06-24 35622.
## 5 2018-07-01 28984.
## 6 2018-07-08 47251.
## 7 2018-07-15 31512.
## 8 2018-07-22 50986.
## 9 2018-07-29 41183.
## 10 2018-08-05 36598.
## # … with 81 more rows
google_analytics_by_page_tbl <- read_rds("../00_data/google_analytics_by_page_daily.rds")
google_analytics_by_page_tbl
## # A tibble: 4,685 x 6
## date pagePath pageViews organicSearches sessions rank
## <date> <chr> <dbl> <dbl> <dbl> <int>
## 1 2019-05-08 / 246 29 125 1
## 2 2019-05-08 /business/2016/08/07/Cus… 38 27 32 11
## 3 2019-05-08 /business/2017/09/18/hr_… 38 28 35 8
## 4 2019-05-08 /business/2017/10/16/sal… 44 30 36 12
## 5 2019-05-08 /business/2017/12/27/six… 28 14 22 9
## 6 2019-05-08 /business/2018/10/08/pyt… 36 28 31 13
## 7 2019-05-08 /business/2019/03/11/ab-… 20 15 16 10
## 8 2019-05-08 /learn.html 12 1 6 19
## 9 2019-05-08 /p/ds4b-101-r-business-a… 8 1 5 14
## 10 2019-05-08 /p/jumpstart-with-r 71 5 28 6
## # … with 4,675 more rows
transactions_tblplot_time_series() to visualize purchased at vs revenuetransactions_tbl %>%
plot_time_series(purchased_at, revenue)
ANSWER: Weekly
There are 20 pages x 3 Metrics (pageViews, organicSearch, and sessions). To make it easier to visualize, do this:
google_analytics_by_page_tblplot_time_series() with .interactive = FALSE and .facet_ncol = 4 (this helps to visualize all of the facets)google_analytics_by_page_tbl %>%
group_by(pagePath) %>%
plot_time_series(date, pageViews, pagePath, .interactive = FALSE, .facet_ncol = 4)
We need to aggregate both data sets to get them on a common frequency before we can join them
transactions_tblsummarise_by_time() with .by = "week", and sum() the revenue.transactions_weekly_tbltransactions_weekly_tbl <- transactions_tbl %>%
summarise_by_time(purchased_at, "week", revenue = sum(revenue))
google_analytics_by_page_tblsummarise_by_time() with .by = "week", and sum() the sessions.pivot_wider() to pivot the names from “pagePath” and values from “sessions” to a wide data. Use names_prefix = "sessions_" to identify the new columns as coming from sessions.contains("/p/") (these are product pages)product_page_sessions_weekly_tblproduct_page_sessions_weekly_tbl <- google_analytics_by_page_tbl %>%
select(date, pagePath, sessions) %>%
group_by(pagePath) %>%
summarise_by_time(date, "week", sessions = sum(sessions)) %>%
pivot_wider(
names_from = pagePath,
values_from = sessions,
names_prefix = "sessions_"
) %>%
select(date, contains("/p/"))
left_join() to join transactions_weekly_tbl and product_page_sessions_weekly_tbltransactions_product_page_sessions_weekly_tbltransactions_product_page_sessions_weekly_tbl <- transactions_weekly_tbl %>%
left_join(
product_page_sessions_weekly_tbl,
by = c("purchased_at" = "date")
)
plot_missing() to inspect the missing data in transactions_product_page_sessions_weekly_tbltransactions_product_page_sessions_weekly_tbl %>% plot_missing()
# Load Checkpoint Data
transactions_product_page_sessions_weekly_tbl <-
read_rds("challenge_01_data_checkpoints/transactions_product_page_sessions_weekly_tbl.rds")
transactions_product_page_sessions_weekly_tblpivot_longer() everything except “purchased_at” to form the data for plotting. This creates 2 columns, “name” and “value”.plot_time_series() to visualize “purchased_at” vs “value”. Use `.facet_ncol = 3transactions_product_page_sessions_weekly_tbl %>%
pivot_longer(-purchased_at) %>%
group_by(name) %>%
plot_time_series(purchased_at, value, .facet_ncol = 3)
Several of the columns have a lot of missing data. These are pages that did not exist until recently in the data, and unfortunately we aren’t going to be able to use them because they will result in data too few rows for the analysis.
transactions_product_page_sessions_weekly_tblends_with() “with-r/”. These columns have very low data.transactions_product_page_subset_tbltransactions_product_page_subset_tbl <- transactions_product_page_sessions_weekly_tbl %>%
select(-contains("bundle"), -ends_with("with-r/"))
drop_na()mutate() and across():
log1p()standardize_vec()log_standardized_transactions_product_page_tbllog_standardized_transactions_product_page_tbl <- transactions_product_page_subset_tbl %>%
drop_na() %>%
mutate(across(-purchased_at, .fns = log1p)) %>%
mutate(across(-purchased_at, .fns = standardize_vec))
Visualize cross correlations between revenue and anything that contains("session"):
log_standardized_transactions_product_page_tblplot_acf_diagnostics() with:
.ccf_vars = contains("session").show_ccf_vars_only = TRUE.facet_ncol = 2log_standardized_transactions_product_page_tbl %>%
plot_acf_diagnostics(
purchased_at,
revenue,
.ccf_vars = contains("session"),
.show_ccf_vars_only = TRUE,
.facet_ncol = 2
)
What can you say about the relationship between 101 Course Page visits & revenue?
What about the relationship between Jumpstart, Learning Labs PRO page visits and Revenue?
Revenue is positively correlated for roughly 2 days after sessions to 101 course page visits. Perhaps students take a day or two to think about the decision. There may be an opportunity to follow up with those that do not enroll.
Learning Labs have a longer period (up to 7 lags) of a relationship with revenue. Jumpstart dips right away (as students take the free class), but then rises through lag 7 to 10. Perhaps this is when students finish the course and then make a purchase.